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In practice, a 256-point FFT is used to approximate the spectrum of ^4(z)and A (z) . 

Transparent quantization of LPC parameters means that the reconstructed speech produced 
quantized LSF is audibly indistinguishable from the original speech produced from the un-quantized 
LSF. It can be achieved using 32-34 bits/frame using scalar quantization (SQ) techniques, while 
vector quantization (VQ) using 24-26 bits/frame retains similar quality [22]. The VQ approaches 
include full-search VQ (FSVQ), split VQ (SVQ), multi-stage VQ (MSVQ), shape-gain VQ (SGVQ), 
and tree-search VQ (TSVQ). A FSVQ requires a large codebook to meet the transparency 
requirements, thus leading to an increase in the search complexity. However, using a structured 
codebook such as a tree-search codebook, split codebook and multi-stage codebook the search 
complexity can be reduced remarkably [11]. 

We describe the vector quantization, codebook structure and the generalised Lloyd algorithm for 
codebook design in Section 4.1. The basic concepts behind the SVQ and the SGVQ are described in 
Section 4.2 and Section 4.3 respectively. 

4.1 Vector Quantization 

A vector quantizer (VQ) quantizes a block of input data as a single vector, thus VQ is 
multidimensional, unlike the uni-dimensional scalar quantizer. A VQ produces less distortion than a 
scalar quantizer for the same number of bits [7]. VQ exploits linear and non-linear dependence among 
the vectors to be quantized. VQ allows different cell shapes, like hexagons, to fill the region 3i K . 
This is unlike in SQ, where the region 3i K is filled with rectangular cells. For example, let N be a 
plane as shown in Figure 4-1 (a). Figure 4-1 (b) and Figure 4-1 (c) depict the rectangular and the 
hexagonal partitions of K respectively. For the same number of quantization cells, hexagonal cell 
shapes result in a lower worst-case error for a statistical error criterion such as Euclidean distance than 
rectangular cells, provided the edge effects are negligible [7], The advantage of VQ is that it allows 

different cell shapes like hexagons that fill the region more efficiently than the rectangular cells in 
allowed in SQ. 
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(a) (b) (c) 

Figure 4-1 Two Dimensional Quantization a) Plane, N b) Rectangular Partition c) Hexagonal Partition. 
Let y be an input vector of length K defined as 

y = [y 0 y\ ••• y K -J (4-4) 

VQ quantizes y toy, 

y = Q(y) (4-5) 

where Q( )is the vector quantization operation. 

The vector y is chosen from a set of L code words C = c, s 0 < / < L -1 such that the nearest 
neighbor rule 

d(y,c t )<d(y,c k \ §<i<k<L-\,i*k (4-6) 

is satisfied. The parameter d(x,z) denotes a distortion measure between x and z , x and 2 being 

column vectors of length K each. The set of vectors C , called the codebook, consists of L code 
vectors and is represented as 

C = [c 0 c, ... c,_J (4-7) 
where c y is the code vector of length K, defined as 
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(4-8) 



Thus, a vector quanti2er Q of dimension K x L maps a vector in the K dimensional Euclidean 

subspace 9?^ , to the finite codebook C of size KxL and chooses the code vector that is closest to 
the input vector. 

Q:3i K ->C (4-9) 
The resolution rate ris the number of bits per vector component used to represent the input vector, 

where B is the number of bits needed to address the code words in C . VQ achieves fractional values 
of resolution, as defined in (4-10), essential for low-bit rate applications [7]. 

A distortion measure d(y,y) associated with quantizing any input vector y to y can be used to 
evaluate the performance of a system. A quantizer is good if the average distortion is small. The most 
common distance measures are the squared error (SE) distance measure denoted by d SE (y,y) 9 the 
mean-square error (MSE) distance measure denoted by d MSE (y,y), and the weighted mean square 
error (WMSE) distance measure denoted by d WMS£ (y,y). These distance measures are respectively 
defined as follows 

SE = d SB (y,y)=(y-y) T (y-y) 

K 2 
*=1 

M5£ = ^(y,y)=l(y-y) 7 (y-y) (4-12) 



(4-10) 



(4-11) 



WMSE = d WMSE (y,y) = l(y -y) r W(y - y) (4-13) 
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where W is a symmetric and positive weighting matrix of size KxK [7], y and y are column vectors 
each of the same length. The WMSE measure includes the MSE measure for W = I , the identity 
matrix. In speech and image compression applications, a matrix W that depends explicitly on the 
input vector y to be quantized, is chosen to obtain perceptually motivated distortion measures [7]. 

For example, let W(y ) be ||y|| 2 1 , where I is the identity matrix and ||y|| > 0 , then (4-13) turns out to 
be the ratio of noise energy to signal energy, 

d WMSE ^,y) = ^^- (4-14) 

For a given noise energy ||y -y|| 2 , d WMSE (y,y) is higher when y is small than when y is large. 

4.1.1 Codebook Structure 

The codebook C of dimension K x L, is obtained by training with a large number of input 
vectors; these are also called the training vectors. The training data is partitioned into L Voronoi 
regions or cells R i in the K dimensional subspace 91* such that there exists no overlap region [7], 

U*,=tt*; *,n*, =[],/<»• i*j (4-15) 

The centroid c, of the Voronoi region R. is the code word representing R. . The Voronoi region R i 
[10] [17] associated with a code vector c ( is defined as 

/?, = |ycSR*: ||y-c,||< |y-cj; 0</<I-l,/**] (4-16) 

VQ quantizes y to c y if y € R t . 

4.1.2 Codebook Initializations 

A codebook can be initialized in different ways. The initialization of the codebook affects the 
overall performance of the resulting codebook. An initial codebook can be obtained by randomly 
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Chapter 2 Linear Prediction of Speech 



In practical speech coding scenarios, speech is usually sampled at 8 kHz. Speech has at most four 
recognizable formant frequencies when it is sampled at 8 kHz. Thus, we attempt to model speech 
with an AR process of at least 8 th order. Most speech coding applications, use 10 th order AR filters. 
Speech is usually considered to be stationary within a 16 to 32 ms interval, leading to a corresponding 
analysis interval of 128 to 256 samples [5]. 

Linear Predictive Coding (LPC) is one of the analysis techniques commonly used in speech coding, 
as discussed in Section 2.1. Several procedures, as implemented in the auto-correlation, covariance, recursive 
least squares algorithms, can be used to provide estimates for the AR coefficients of the speech model. 
The Recursive Least Squares (RLS) based technique is used in the proposed speech coder. We 
describe the RLS algorithm in Section 2.2. However, the RLS algorithm requires computations of the 
inverse of the auto-correlation matrix of the input data, resulting in computational complexity on the 
order of the square of the order of the filter [1]. 

In speech the formants are widely separated in frequency, they occur in the neighborhood of [400 
1000 1600 2400] Hz, and consequendy the corresponding pairs of poles are dominant in only certain 
frequency bands. This nature of speech makes it feasible to model speech as if it were generated by 
cascaded AR sections of lower (here, second) order [4]. The Cascaded RLS with Subsection 
Adaptation (CRLS-SA) algorithm [4] for adapting the AR filter coefficients associated with the 
cascaded model is one of the techniques, based on the RLS method, which significandy reduces the 
required computational effort relative to the RLS algorithm as explained in Section 2.3. The CRLS-SA 
takes advantage of the fact that, for inverse filtering applications, the gradients of each section in the 
cascade are almost uncorrected with the gradients in other sections. In CRLS-SA the gradient auto- 
correlation matrix is assumed to be block diagonal, involving only 2x2 gradient auto-correlation 
matrices. This assumption reduces the computational complexity of RLS. 
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2.1 Linear Prediction 

We assume that the speech signal y n is an IV th order AR process represented by 



/V 



(2-1) 



where u n is the white noise excitation of the AR model, with its coefficient vector a w , given by 



:.=[i 

=[i -air 



>2 



(2-2) 



»»=k.i a »a ••• " n , N Y (2-3) 

and y w is the speech sample. The subscript //, in all the parameters represent time index. Let y n be a 
speech input vector with past values given by 

y„=[>Vi >v 2 - y n - N J (2-4) 

Linear Prediction (LP) is a method of predicting the unknown signal from its past. Say y n is the 
output of an unknown system with some unknown input u n . Thus, from (2-1) given the past outputs, 
the output y n is estimated. Let y n , be the estimated value of y n at time n, such that 

N 

yn = Y*Kky»-k (2-5) 



where 



a„=k. ^,2 ■■• (2-6) 
are the estimated LPC coefficients defining A n (z): 
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k=l 



(2-7) 



Since y n is estimated based on known knowledge at time (#-1), this operation is termed forward linear 
prediction and depicted in Figure 2-1. The forward prediction error f/ is the difference between the input 
sample y n and its predicted value y n , and formulated as 



f e y n y n 



(2-8) 





Figure 2-1 AR Process and Linear Prediction. 



In other words, the linear prediction process finds an estimate of A n (z) , A n {z) , by way of 

filtering y n with the inverse filter and minimizing the error according to some criterion. 

The operation of prediction error filtering applied to a stationary process {y„}is termed as analysis 
process. Hereby, we define an AT 11 prediction error filter, also termed analysis filter H a (z) as 



H a {z)=\-A n (4 



(2-9) 



where A n (z) is as defined in (2-7). In theory, as the order of the prediction error filter increases, the 

correlation between input samples is reduced and eventually when the order becomes high enough, the 
output process consists of a sequence of uncorrelated samples [1]. A sequence of uncorrelated 
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random variables is called white-noise. A white-noise process denoted by {w n } has zero-mean and 



variance (X * , such that 



Ekw;]=J^' [ = { (2-10) 
1 0, i* J 



The process of converting a correlated input into white-noise output is termed as whitening process. 
When the output becomes white, the input process {y n } can be represented by the analysis filter 

coefficients and the prediction error power P M = (7^ . 

The auto-regressive modeling of the stationary process {y n \ is termed as synthesis process. The 
analysis process and synthesis process are complementary to each other. In other words with white 
noise process {w n } of zero-mean and variance (J I at the input, an inverse analysis filter produces the 
stationary process \y n }. The inverse analysis filter is termed as synthesis filter H s (z) 

H s (z)=[l-A n (z)Y (2-11) 

The analysis filter is an all-zero filter with an impulse response of finite duration. Conversely, the 
synthesis filter is an all-pole filter with an impulse response of infinite duration. The zeros of analysis 
filter lie inside the unit-circle and are located at exacdy the same position as the poles of the synthesis 
filter. This ensures that the analysis filter and the inverse analysis filter or synthesis filter are both 
stable. Thus, the filters exhibit minimum-phase property. 

In summary, given an AR speech model of order N, LPC analysis minimizes the residual error e , 

resulting in the filter [l - A n (z)]~ given by 

e»=y«- fe.iJ^-i + <W*-2 + • • • + Wo ) (2-12) 
where \a n X a n 2 . . . a n N J are the estimated LPC parameters. 
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ANNOUNCEMENTS 



October 30, 2002 



CSE 291 LECTURE NOTES 




I'm handing out a sample solution for the first assignment (thanks Bianca!) and the new third 
assignment . This assignment is more open-ended and closer to how you might formulate a research 
project. 

Today's notes are based on Chapter 3 of Hastie/Tibshirani/Friedman. They use facts from linear algebra 
which are not discussed explicitly in the chapter. We'll try to state these during the lecture. 



LINEAR REGRESSION 

Suppose we want to learn how a random variable Y depends on random variables XI ... Xp. This is a 
multivariate scenario unlike the univariate situations we've seen so far. The model we assume is linear: 

E[Y|X] = f(X) = bO + SUMJ Xj * bj 

The parameters we want to estimate (also called coefficients) are the p+1 scalars bj. 

Suppose we have N training examples (subscript i). Each training example is a p-dimensional column 
vector (subscript j) along with a scalar. The components of a training example are called independent 
variables, features, attributes, or predictors. These are all synonyms! 

The most popular estimation method is called "least squares." We pick the b vector to minimize 

RSS(b) = SUMJ (f(xi) - yi) A 2 

"RSS" stands for "residual sum of squares." Note that this minimization treats all the xi equally, and 
that it penalizes large deviations dramatically. Background knowledge about the real-world scenario 
may imply that these choices are not appropriate. But we will see soon that least squares gives 
minimum-variance unbiased estimates. 

For now let's look at least squares algorithmically and geometrically. 



MINIMIZING RSS 

Let X be a matrix where each row is one training example, and the first column is all ones, so the size of 
X is N by p+1 . Let y be the column vector of observed values for the dependent variable. In matrix 
notation 
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RSS(b) = (y - Xb) T (y - Xb) 

To explain this in detail: Xb is a matrix times a column vector, so we take the dot-product of each row 
of X with b. This gives a column vector of size N, which we can subtract from y, giving e = y-Xb. Now 
we take the dot product of e with itself by first converting it into a row vector with the transpose 
operation, indicated by the superscript T. 

Note that y and X are fixed and the result RSS(b) is a scalar function of the p+1 parameters that are 
components of the vector b. To minimize RSS(b) we set its derivative to zero. The derivative is 

d/db RSS(b) = -2X T (y - Xb) 

This result can be proved by going back to the non-matrix formulation and computing the derivative of 
the RSS sum using standard calculus. The second derivative is 

d/db -2XT(y - Xb) = -2X T X 

We can solve the equation -2X T (y - Xb) = 0 using the matrix inverse: 

2X T y = -2X T <(-Xb) 
(xTx^X^y = b 

This solution is only valid if the inverse actually exists, i.e. the matrix X T X is non-singular. Note that 
X T X is square,of size p+1 by p+1. 

We can also prove that any solution of -2X T (y - Xb) = 0 minimizes the RSS using linear algebra; see 
Appendix A of Silvey. 

LINEAR DEPENDENCE 

Suppose the columns of X are not linearly independent, e.g. one feature is a linear ombination of other 
features. This will happen for example if we use a "one of n" coding for a feature that i sintrinsically 
discrete. Then X T X is singular and the least-squares coefficients b are not defined uniquely. This 
makes sense: you can get equally good predictions for y from alternative linear functions of the input 
vector. But the predicted values y hat are still uniquely defined. 

Linear dependence between columns will also happen when the number of rows (i.e. training examples) 
is less than the number of columns (i.e. features) 
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